For the record, I became a pescetarian (vegetarian + seafood) for some years, but I get tempted sometimes to sin. At a yelp-highly-rated chicken place this summer, I was waiting for my order, which took forever (exaggerated). Without any other better things to do, I re-studied the menu there, trying to understand the pricing structure. See photo:
So, I was wondering to myself tons of questions: how can I get the best deal? What is the cost per wing, and cost per drum? Does the restaurant use a formula to determine the pricing? What if I add a sin-ful parameter as a penalty term, how much should I order next time to minimize my sin (Root-Mean-Squared-Sinfulness)?
Dang! I am cursed. Shouldn’t have come here, I thought…
You can see the set prices clearly. We will need to re-format our data, and we can practice our Markdown skill at the same time to tabulate them here:
| Wings | Drums | Price |
|---|---|---|
| 5 | 0 | 6.69 |
| 10 | 0 | 12.99 |
| 15 | 0 | 17.99 |
| 20 | 0 | 23.99 |
| 40 | 0 | 45.99 |
| 0 | 3 | 6.69 |
| 0 | 5 | 10.99 |
| 0 | 10 | 19.99 |
| 0 | 15 | 29.99 |
| 3 | 2 | 7.99 |
| 7 | 4 | 16.99 |
| 12 | 6 | 27.99 |
| 20 | 9 | 43.99 |
Let us now enter these into a dataframe in R. I will echo out the codes, but in general, it’s better not to unless you have something to show the audience in your codes.
# fried-chicken-price
fcp <- data.frame(wing=c(NA,10,15,20,40,0,0,0,0,3,7,12,20), drum=c(0,0,0,0,0,3,5,10,15,2,4,6,9), price=c(6.69,12.99,17.99,23.99,45.99,6.69,10.99,19.99,29.99,7.99,16.99,27.99,43.99) )
So there are 13 data points in our dataset. (Notice the use of inline R codes here.)
There are a bit of things we typically look at for EDA.
str(fcp)
## 'data.frame': 13 obs. of 3 variables:
## $ wing : num NA 10 15 20 40 0 0 0 0 3 ...
## $ drum : num 0 0 0 0 0 3 5 10 15 2 ...
## $ price: num 6.69 12.99 17.99 23.99 45.99 ...
summary(fcp)
## wing drum price
## Min. : 0.00 Min. : 0.000 Min. : 6.69
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:10.99
## Median : 8.50 Median : 3.000 Median :17.99
## Mean :10.58 Mean : 4.154 Mean :20.94
## 3rd Qu.:16.25 3rd Qu.: 6.000 3rd Qu.:27.99
## Max. :40.00 Max. :15.000 Max. :45.99
## NA's :1
Upon inspection, we can see immediately there is a “NA” entry for wing. That’s a mistake. Have an action plan to fix it. After we fix it, re-run the EDA.
fcp[1,1]=5 # equal sign "=" and assignment operator "<-" are interchangeable in R
str(fcp)
## 'data.frame': 13 obs. of 3 variables:
## $ wing : num 5 10 15 20 40 0 0 0 0 3 ...
## $ drum : num 0 0 0 0 0 3 5 10 15 2 ...
## $ price: num 6.69 12.99 17.99 23.99 45.99 ...
summary(fcp)
## wing drum price
## Min. : 0.00 Min. : 0.000 Min. : 6.69
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:10.99
## Median : 7.00 Median : 3.000 Median :17.99
## Mean :10.15 Mean : 4.154 Mean :20.94
## 3rd Qu.:15.00 3rd Qu.: 6.000 3rd Qu.:27.99
## Max. :40.00 Max. :15.000 Max. :45.99
Since the features/variables are all numerical (quantitative), it makes most sense to check their linear correlations.
loadPkg("corrplot")
corrmatrix = cor(fcp) # more detailed pair-wise correlation test can be obtained from cor.test(fcp$wing,fcp$price) etc
corrplot.mixed(corrmatrix)
It is good to see that price has a decent correlation with both wing and drum, with wing seems to have a higher correlation. It is also great that drum and wing do not have too strong a correlation between them. Imagine if each drum I buy, I always will be thrown with 2 dog bones, I will have a hard time knowing if I am really paying for the drum or the dog bones. This is called a problem of collinearity. (We’ll check with VIF.)
It is usually best if all the variables (numerical ones) are normally distributed. This usually does not happen, but we would still like to know overall the distribution is bell-shaped, and not too awkward.
We can first check the QQ-plots for each variable by itself. A straight line means normal distribution.
qqnorm(fcp$price, main = "price Q-Q Plot")
qqline(fcp$price)
qqnorm(fcp$wing, main = "wing Q-Q Plot")
qqline(fcp$wing)
qqnorm(fcp$drum, main = "drum Q-Q Plot")
qqline(fcp$drum)
Not too well of straight line fits, but with only 13 data points, it’s probably okay.
Next we can check the boxplots for a rough visual.
boxplot(fcp, col=c("red","blue","green"), ylab="count or price($)")
axis(side = 4, ylab="price($)")
Same conclusion: they do not look like normal, but we’ll take it.
Now histograms:
hist(fcp$price)
hist(fcp$wing)
hist(fcp$drum)
And finally, using shapiro-wilk test:
shapiro.test(fcp$price)
##
## Shapiro-Wilk normality test
##
## data: fcp$price
## W = 0.89892, p-value = 0.1293
shapiro.test(fcp$wing)
##
## Shapiro-Wilk normality test
##
## data: fcp$wing
## W = 0.83692, p-value = 0.01936
shapiro.test(fcp$drum)
##
## Shapiro-Wilk normality test
##
## data: fcp$drum
## W = 0.85207, p-value = 0.03033
We’ll learn to interpret these soon.
Now if those passed the smell test, we can try build some models. Linear model is first to come to mind. We will model the price with wing and drum as the two independent variables (also called features).
chicklm = lm(price ~ ., data=fcp)
summary(chicklm)
##
## Call:
## lm(formula = price ~ ., data = fcp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30661 -0.30051 -0.04708 0.15236 1.84476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.77755 0.50378 1.543 0.154
## wing 1.16298 0.02546 45.670 6.10e-13 ***
## drum 2.01202 0.06202 32.441 1.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9386 on 10 degrees of freedom
## Multiple R-squared: 0.9957, Adjusted R-squared: 0.9949
## F-statistic: 1165 on 2 and 10 DF, p-value: 1.426e-12
As we can see, the R\(^2\) (using \(\LaTeX\) formatting here) value of the linear model is 0.9957, which shows the prices are set with rather strict per-piece-structure.
If we use the (default 95%) confidence intervals of the coefficients as shown here:
coeffconfint = confint.lm(chicklm)
# or just
# confint(chicklm)
coeffconfint
## 2.5 % 97.5 %
## (Intercept) -0.3449566 1.900048
## wing 1.1062375 1.219715
## drum 1.8738264 2.150211
We find that:
Check:
We should always check for multi-collinearity in linear models.
loadPkg("faraway") # faraway library is one of them has a vif function
vif(chicklm)
## wing drum
## 1.186424 1.186424
With vif values less than 5, we can safely conclude there is not much collinearity concerns in the dataset.
So what is the best deal???
Let me compare the predicted values from the model with the actual values, or look at the residuals.
fcp$fitval = chicklm$fitted.values
fcp$residuals = chicklm$residuals
fcp
## wing drum price fitval residuals
## 1 5 0 6.69 6.592428 0.09757171
## 2 10 0 12.99 12.407311 0.58268922
## 3 15 0 17.99 18.222193 -0.23219327
## 4 20 0 23.99 24.037076 -0.04707576
## 5 40 0 45.99 47.296606 -1.30660571
## 6 0 3 6.69 6.813602 -0.12360174
## 7 0 5 10.99 10.837639 0.15236097
## 8 0 10 19.99 20.897732 -0.90773226
## 9 0 15 29.99 30.957825 -0.96782549
## 10 3 2 7.99 8.290513 -0.30051259
## 11 7 4 16.99 16.966456 0.02354413
## 12 12 6 27.99 26.805376 1.18462435
## 13 20 9 43.99 42.145244 1.84475643
Ah, the residual is most negative with the 40-wing combo at -$1.31. I should go for that and kill myself.
But if I am wiser, and insist on having a one-person portion:
Case closed. Mic dropped.
Oh wait… need to include the sin penalty…
And the wings there were really really good.